We began by selecting linear (univariate and multivariate) regression models to examine how they fit our data. Linear regression is a conventional, common approach that may explain the association with tip well, so we chose to test it first. To strengthen our linear model, we also used lasso, ridge, and principal component analysis (PCA). We also made use of decision trees and random forest as regression. The capacity of decision trees to mimic non-linear connections is one of its advantages. According to our EDA, journey duration, distance, and fare are all linearly connected over small distances, but this connection weakens over longer distances due to the involvement of other possible factors.Consequently, there may be in fact a non-linear relationship with tip, too.
We prepped our data for modeling before developing our models by using one hot encoding, establishing training and testing sets, and scaling our data.
We employed one hot encoding to convert factor columns to numerical columns. All factors will be converted into a distinct boolean column by a one hot encoding.
## Rows: 108,332
## Columns: 48
## $ pickup_time <chr> "2022-05-31 20:25:41", "2022-05-31 20:21:00…
## $ dropoff_time <chr> "2022-05-31 20:48:22", "2022-05-31 20:59:50…
## $ trip_distance <dbl> 11.00, 18.18, 10.60, 10.40, 12.33, 6.88, 18…
## $ fare_amount <dbl> 32.0, 52.0, 31.0, 30.0, 37.5, 21.5, 52.0, 5…
## $ tip_amount <dbl> 2.00, 12.37, 10.65, 12.10, 11.96, 6.62, 12.…
## $ tolls_amount <dbl> 6.55, 6.55, 6.55, 6.55, 6.55, 6.55, 6.55, 6…
## $ tip_perc <int> 6, 24, 34, 40, 32, 31, 24, 24, 19, 36, 29, …
## $ trip_duration <int> 22, 38, 22, 22, 33, 14, 39, 23, 29, 33, 31,…
## $ VendorID_1 <int> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ VendorID_2 <int> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
## $ passenger_count_1 <int> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ passenger_count_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_3 <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_4 <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ passenger_count_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_1 <int> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1…
## $ RatecodeID_2 <int> 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0…
## $ RatecodeID_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RatecodeID_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Bronx <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Brooklyn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_EWR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Manhattan <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PULocation_Queens <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ `PULocation_Staten Island` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Bronx <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Brooklyn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_EWR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DOLocation_Manhattan <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DOLocation_Queens <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `DOLocation_Staten Island` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Fri <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Mon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Sat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Sun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Thu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_Tue <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day_Wed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Afternoon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Evening <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ PU_time_of_day_Morning <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PU_time_of_day_Night <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Afternoon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Evening <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DO_time_of_day_Morning <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DO_time_of_day_Night <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Post One Hot Encoding (OHE) we are now left with 48 columns.
Because the magnitude of the values may not be proportionate, we must scale the numerical variables in our datasets. For comparative reasons, we compute the mean and standard deviation of each numerical column.
In order to eliminate any bias in test results while utilizing train data, the train-test split should be implemented before (most) data modeling. We randomly divided the dataset into 70% train and 30% test to replicate a train and test set.
Number of rows of observations in training dataset is 75799 and in testing dataset are 32533 post split.
From Wikipedia
The subsequent step post preparing our data is to employ a number of regression-based methods to extract insights from data, which we can then use to predict which result is likely to hold true for our target variable based on training data.
We chose PCA as a variable reduction strategy because the majority of our variables were associated with one another and there were 48 features.
Variables graph: Variables that are positively associated point to the same side of the plot. Negatively associated variables point to the graph’s opposing sides.
Observations: Even if the fist three components explain 94.1% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. However this gives us enough statistical basis for which variables to go after. Hence, we proceed to build out Linear regression model with these variables.
From the results of the Principle component analysis, we constructed a linear model with tip percentage along with the first three high variability explainers and correlated values for ‘Tip Percentage’ - trip_distance (-0.28), trip_duration (-0.175), and toll_amount (-0.04).
ANOVA tests on all the three models
The summary for three linear models is:
| Fit | Model Equation | R^2 | ANOVA P-value | AIC |
|---|---|---|---|---|
| 1 | tip_perc ~ trip_duration | 0.03 | – | 509392.627 |
| 2 | tip_perc ~ trip_duration+fare_amount | 0.0305 | 0.000000000145 | 513199.554 |
| 3 | tip_perc ~ trip_duration+fare_amount+trip_distance | 0.078 | 0 | 513236.639 |
Observations: Looking at the combination of p-value and r-squared, we conclude that fit3 performs slightly better than the other two fits. Hence, we check if there is an improvement in model 3 in the absence of outliers.
Even after treating the outliers in our model three fit, There is little to no difference in the results, and the r-squared and MAPE values remains 0.0835 and 0.185.
Lasso regression is a form of regularization (L1) approach that might result in coefficients that are canceled out (in other words, some of the features are completely neglected for the evaluation of output). As a result, it not only helps to reduce over-fitting, but it may also aid in feature selection.
As we increase the value of lambda, the bias increase and variance decrease, so we Iterated through a set of lambda values to find the optimum value. The graph below shows how lasso reduces the value of unnecessary attribute coefficients to 0. Only the five attributes with the greatest coefficient values are indicated for greater visibility
It is interesting, that the Trip Distance (Short and Long Trip) and the Standard Rate Applied to the Rides (Rate Code 1) survived the longest. it also surprising to see how long did the Drop off Location Bronx prevails, which is understandable.
The lambda value that minimizes the test MSE turns out to be 0.002 .
There is a slight improvement in the r-sqaured and mape value in comparison to the base linear model, however the r-squared is only 0.0991, which means there is room for a lot improvement.
The coefficients that best suit the data are discovered using the least squares approach. It should also determine the unbiased coefficients as a further requirement. Here, unbiased refers to the fact that OLS ignores the independent variables’ relative importance. A given data set’s coefficients are easily found. In other words, the lowest “Residual Sum of Squares (RSS)” can only be obtained from one set of betas. It therefore poses a question whether the model with the lowest RSS is actually the better model.
In a sense, OLS offers the model with the highest variance and the lowest bias, and it gets more complex as the number of variables rises. Although it is stationary and never moves, we still want a model with little bias and little variance. This void can be filled by Ridge, which is also known as regularization. Since the ridge regression penalizes coefficients, the least effective ones in the estimation will “shrink” the quickest. In ridge regression, the lambda parameter (penalizing factor) can be adjusted to alter the model coefficients.
Again, Only the five attributes with the greatest coefficient values are indicated for greater visibility.
Observations: The plot shows the whole path of variables as they shrink towards zero as lambda increases. The Pick up location of Staten Island and Newark Airport survives the longest as they shrink to zero.
The lambda value that minimizes the test MSE turns out to be 0.978 .
The Plot shows that all the variables explain ~9.30% (~0.0930 point on the plot) of the variance in the data. Same is bolstered by the R2 value of the model.
The classification and regression tree (CART) methodology is one of the earliest methods for creating regression trees, however there are many more. A data set is divided into smaller subgroups by basic regression trees, which then fit a straightforward constant to each observation in each segment. By using successive binary partitions (also known as recursive partitioning) depending on several predictors, the partitioning is accomplished.
Cost complexity criterion
To enhance prediction performance on certain unknown data, a balance in the depth and complexity of the tree is generally required. To achieve this balance, we generally create a very big tree and then prune it back to identify an ideal subtree. We identify the best subtree by applying a cost complexity parameter (α) that penalizes our objective function for the number of terminal nodes in the tree.
When we consider all the variables while building our decision tree, the model quickly becomes overfitted.
The above shows the compares the error over the range of α’s (cost complexity - cp value at the bottom X-axis). The upper X-axis gives the number of nodes. We can see returns diminish after around 10 leafs (dashed vertical line).
Pruning the decision tree to 10 variables gives a much better model as seen below.
The above plot confirms that only the first ten variables actually contributes towards reducing the relative error.
Finally, we apply our last model to further obtained enhanced results. The Random Forest builds on top of the classical decision tree by a method called Bagging
Note: Due to limitation in computation power, the number of trees are limited to 100.
##
## Call:
## randomForest(formula = tip_perc ~ trip_distance + tolls_amount + trip_duration + VendorID_1 + VendorID_2 + passenger_count_1 + passenger_count_2 + passenger_count_3 + passenger_count_4 + passenger_count_5 + passenger_count_6 + RatecodeID_1 + RatecodeID_2 + RatecodeID_3 + RatecodeID_4 + RatecodeID_5 + PULocation_Bronx + PULocation_Brooklyn + PULocation_EWR + PULocation_Manhattan + PULocation_Queens + PULocation_Staten_Island + DOLocation_Bronx + DOLocation_Brooklyn + DOLocation_EWR + DOLocation_Manhattan + DOLocation_Queens + DOLocation_Staten_Island + day_Fri + day_Mon + day_Sat + day_Sun + day_Thu + day_Tue + day_Wed + PU_time_of_day_Afternoon + PU_time_of_day_Evening + PU_time_of_day_Morning + PU_time_of_day_Night + DO_time_of_day_Afternoon + DO_time_of_day_Evening + DO_time_of_day_Morning + DO_time_of_day_Night, data = train, ntree = 100, keep.forest = FALSE, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 14
##
## Mean of squared residuals: 50
## % Var explained: 5.1
Node purity is the total decrease in residual sum of squares when splitting on a variable averaged over all trees (i.e. how well a predictor decreases variance). Importance gives you what the model has learnt. The above plot shows, for each variable, how important it is in classifying the data. The Mean Decrease Accuracy plot expresses how much accuracy the model losses by excluding each variable. The more the accuracy suffers, the more important the variable is for the successful classification. The variables are presented from descending importance.
Since the results of the Random Forest was so low, we decided to exclude it from out model selection.
The unpruned decision tree is the optimal model when aiming for a low MAPE, low AIC, and high r-squared. It is crucial to remember that when comparing the different models, the MAPE number is typically the same, hovering around 0.18 - 0.22, which denotes a 78-82 % accuracy. But all of the models’ r-squared values are quite low, explaining about 8% to 11% of the variance in our dependent variable. As a result, this suggests that the models are neither thorough nor accurate fits.
The subsequent step post preparing our data is to employ a number of regression-based methods to extract insights from data, which we can then use to predict which result is likely to hold true for our target variable based on training data.
We chose PCA as a variable reduction strategy because the majority of our variables were associated with one another and there were 48 features.
Variables graph: Variables that are positively associated point to the same side of the plot. Negatively associated variables point to the graph’s opposing sides.
Observations: Even if the first three components explain 94% of the variance in the data, that may not necessarily mean that a good R2 or high coefficients will result. However this gives us enough statistical basis for which variables to go after. Hence, we proceed to build out Linear regression model with these variables.
From the results of the Principle component analysis, we constructed a linear model with tip amount along with the first three high variability explainers and correlated values for ‘Tip_Amount’: fare_amount (0.65), trip_distance (0.54), and trip_duration (0.36).
ANOVA tests on all the three models
The summary for three linear models is:
| Fit | Model Equation | R^2 | ANOVA P-value | AIC |
|---|---|---|---|---|
| 1 | tip_amount ~ trip_duration | 0.13 | – | 172650.796 |
| 2 | tip_amount ~ trip_duration+fare_amount | 0.432 | 0 | 172813.743 |
| 3 | tip_amount ~ trip_duration+fare_amount+trip_distance | 0.433 | 0.0000000000000000000000000000000000000944 | 205075.254 |
Observations: Looking at the combination of p-value and r-squared, we conclude that fit3 performs slightly better than the other two fits. Hence, we check if there is an improvement in model 3 in the absence of outliers.
Even after treating the outliers in our model three fit, There is little to no difference in the results, and the r-squared and MAPE values remains 0.439 and 4.662.
Lasso regression is a form of regularization (L1) approach that might result in coefficients that are canceled out (in other words, some of the features are completely neglected for the evaluation of output). As a result, it not only helps to reduce over-fitting, but it may also aid in feature selection.
As we increase the value of lambda, the bias increase and variance decrease, so we Iterated through a set of lambda values to find the optimum value. The graph below shows how lasso reduces the value of unnecessary attribute coefficients to 0. Only the five attributes with the greatest coefficient values are indicated for greater visibility.
As expected the fare amount survives the longest. However, it surprising to see how long did the Toll Amount and Drop Off Location Bronx prevails.
The lambda value that minimizes the test MSE turns out to be 0 .
As with before, the r-squared is around 0.4463, which means there is room for a lot improvement.
The coefficients that best suit the data are discovered using the least squares approach. It should also determine the unbiased coefficients as a further requirement. Here, unbiased refers to the fact that OLS ignores the independent variables’ relative importance. A given data set’s coefficients are easily found. In other words, the lowest “Residual Sum of Squares (RSS)” can only be obtained from one set of betas. It therefore poses a question whether the model with the lowest RSS is actually the better model.
In a sense, OLS offers the model with the highest variance and the lowest bias, and it gets more complex as the number of variables rises. Although it is stationary and never moves, we still want a model with little bias and little variance. This void can be filled by Ridge, which is also known as regularization. Since the ridge regression penalizes coefficients, the least effective ones in the estimation will “shrink” the quickest. In ridge regression, the lambda parameter (penalizing factor) can be adjusted to alter the model coefficients.
Again, Only the five attributes with the greatest coefficient values are indicated for greater visibility.
Observations: The plot shows the whole path of variables as they shrink towards zero as lambda increases. The Pick up location of Staten Island and Nassau or Winchester (Rate Code 4) survives the longest as they shrink to zero.
The lambda value that minimizes the test MSE turns out to be 0.282 .
The Plot shows that all the vairables explain ~42% (~0.4288 point on the plot) of the variance in the data. Same is bolstered by the R2 value of the model.
The classification and regression tree (CART) methodology is one of the earliest methods for creating regression trees, however there are many more. A data set is divided into smaller subgroups by basic regression trees, which then fit a straightforward constant to each observation in each segment. By using successive binary partitions (also known as recursive partitioning) depending on several predictors, the partitioning is accomplished.
Cost complexity criterion
To enhance prediction performance on certain unknown data, a balance in the depth and complexity of the tree is generally required. To achieve this balance, we generally create a very big tree and then prune it back to identify an ideal subtree. We identify the best subtree by applying a cost complexity parameter (α) that penalizes our objective function for the number of terminal nodes in the tree.
When we consider all the variables while building our decision tree, the model quickly becomes overfitted.
The above shows the compares the error over the range of α’s (cost complexity - cp value at the bottom X-axis). The upper X-axis gives the number of nodes. We can see returns deminish after around 13 leafs (dashed vertical line).
Pruning the decision tree to 13 variables gives a much better model as seen below.
Finally, we apply our last model to further obtained enhanced results. The Random Forest builds on top of the classical decision tree by a method called Bagging
##
## Call:
## randomForest(formula = tip_amount ~ trip_distance + fare_amount + tolls_amount + trip_duration + VendorID_1 + VendorID_2 + passenger_count_1 + passenger_count_2 + passenger_count_3 + passenger_count_4 + passenger_count_5 + passenger_count_6 + RatecodeID_1 + RatecodeID_2 + RatecodeID_3 + RatecodeID_4 + RatecodeID_5 + PULocation_Bronx + PULocation_Brooklyn + PULocation_EWR + PULocation_Manhattan + PULocation_Queens + PULocation_Staten_Island + DOLocation_Bronx + DOLocation_Brooklyn + DOLocation_EWR + DOLocation_Manhattan + DOLocation_Queens + DOLocation_Staten_Island + day_Fri + day_Mon + day_Sat + day_Sun + day_Thu + day_Tue + day_Wed + PU_time_of_day_Afternoon + PU_time_of_day_Evening + PU_time_of_day_Morning + PU_time_of_day_Night + DO_time_of_day_Afternoon + DO_time_of_day_Evening + DO_time_of_day_Morning + DO_time_of_day_Night, data = train, ntree = 100, keep.forest = FALSE, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 14
##
## Mean of squared residuals: 0.602
## % Var explained: 40.2
Node purity is the total decrease in residual sum of squares when splitting on a variable averaged over all trees (i.e. how well a predictor decreases variance). Importance gives you what the model has learnt. The above plot shows, for each variable, how important it is in classifying the data. The Mean Decrease Accuracy plot expresses how much accuracy the model losses by excluding each variable. The more the accuracy suffers, the more important the variable is for the successful classification. The variables are presented from descending importance.
As seen from the above plot, the trip duration, duration of the trip, and fare amount has the highest impact on the model if they were to be removed.
The unpruned decision tree is the optimal model when aiming for a low MAPE, low AIC, and high r-squared. It is crucial to remember that when comparing the different models, the MAPE number is typically the same, hovering around 0.18 - 0.20, which denotes a 80-82 % accuracy. But all of the models’ r-squared values are quite low, explaining about 7% to 15% of the variance in our dependent variable. As a result, this suggests that the models are neither thorough nor accurate fits.
| technique | dependent | mape | Rsquare | AIC |
|---|---|---|---|---|
| Linear(3 vars with best cor-coeffs) | tip_perc | 0.185 | 0.0835 | 513236.638877281 |
| Linear-treated outlier | tip_perc | 0.185 | 0.0835 | 509392.626959756 |
| Lasso | tip_perc | 0.183 | 0.0991 | -372632.599400254 * |
| Ridge | tip_perc | 0.183 | 0.0930 | -344041.945716197 * |
| Decision Tree | tip_perc | 0.226 | 0.0463 | – |
| Decision Tree (Prune) | tip_perc | 0.183 | 0.1020 | – |
| Linear(3 vars with best cor-coeffs) | tip_amount | 4.662 | 0.4394 | 205075.253842576 |
| Linear-treated outlier | tip_amount | 4.662 | 0.4394 | 172650.795971072 |
| Lasso | tip_amount | 6.322 | 0.4463 | -33544.6455051331 * |
| Ridge | tip_amount | 8.136 | 0.4288 | -31818.3532968901 * |
| Decision Tree | tip_amount | 6.521 | 0.3444 | – |
| Decision Tree (Prune) | tip_amount | 2.049 | 0.4349 | – |